150 research outputs found

    B-tree indexes for high update rates

    Get PDF
    In some applications, data capture dominates query processing. For example, monitoring moving objects often requires more insertions and updates than queries. Data gathering using automated sensors often exhibits this imbalance. More generally, indexing streams apparently is considered an unsolved problem. For those applications, B-tree indexes are reasonable choices if some trade-off decisions are tilted towards optimization of updates rather than of queries. This paper surveys techniques that let B-trees sustain very high update rates, up to multiple orders of magnitude higher than tradi-tional B-trees, at the expense of query processing performance. Perhaps not surprisingly, some of these techniques are reminiscent of those employed during index creation, index rebuild, etc., while others are derived from other well known technologies such as differential files and log-structured file systems

    Robust and Efficient Sorting with Offset-Value Coding

    Full text link
    Sorting and searching are large parts of database query processing, e.g., in the forms of index creation, index maintenance, and index lookup; and comparing pairs of keys is a substantial part of the effort in sorting and searching. We have worked on simple, efficient implementations of decades-old, neglected, effective techniques for fast comparisons and fast sorting, in particular offset-value coding. In the process, we happened upon its mutually beneficial relationship with prefix truncation in run files as well as the duality of compression techniques in row- and column-format storage structures, namely prefix truncation and run-length encoding of leading key columns. We also found a beneficial relationship with consumers of sorted streams, e.g., merging parallel streams, in-stream aggregation, and merge join. We report on our implementation in the context of Google's Napa and F1 Query systems as well as an experimental evaluation of performance and scalability

    Sort-based grouping and aggregation

    Full text link
    Database query processing requires algorithms for duplicate removal, grouping, and aggregation. Three algorithms exist: in-stream aggregation is most efficient by far but requires sorted input; sort-based aggregation relies on external merge sort; and hash aggregation relies on an in-memory hash table plus hash partitioning to temporary storage. Cost-based query optimization chooses which algorithm to use based on several factors including input and output sizes, the sort order of the input, and the need for sorted output. For example, hash-based aggregation is ideal for small output (e.g., TPC-H Query 1), whereas sorting the entire input and aggregating after sorting are preferable when both aggregation input and output are large and the output needs to be sorted for a subsequent operation such as a merge join. Unfortunately, the size information required for a sound choice is often inaccurate or unavailable during query optimization, leading to sub-optimal algorithm choices. To address this challenge, this paper introduces a new algorithm for sort-based duplicate removal, grouping, and aggregation. The new algorithm always performs at least as well as both traditional hash-based and traditional sort-based algorithms. It can serve as a system's only aggregation algorithm for unsorted inputs, thus preventing erroneous algorithm choices. Furthermore, the new algorithm produces sorted output that can speed up subsequent operations. Google's F1 Query uses the new algorithm in production workloads that aggregate petabytes of data every day

    Database Workload Management (Dagstuhl Seminar 12282)

    Get PDF
    This report documents the program and the outcomes of Dagstuhl Seminar 12282 "Database Workload Management". Dagstuhl Seminar 12282 was designed to provide a venue where researchers can engage in dialogue with industrial participants for an in-depth exploration of challenging industrial workloads, where industrial participants can challenge researchers to apply the lessons-learned from their large-scale experiments to multiple real systems, and that would facilitate the release of real workloads that can be used to drive future research, and concrete measures to evaluate and compare workload management techniques in the context of these workloads
    • …
    corecore